Checking the versions

Loading the data

Let's try to understand what each column means,
user_id: The user ID assigned to new users
signup_time: Time of the account creation
purchase_time: Time of the first purchase
elapsed_time : Time taken in months to make first transaction
purchase_value : Amount spent on purchase
device_id : The device ID which is unique by device
source : User marketing channel such as Direct, SEO or Advertisement
browser : The browser used by the user
Sex : Gender of the user
Age : Age of the user
ip_address : IP address of the device used
class : Transaction is fraudulent or not, 0 for non-fraudulent and 1 for fraudulent
country : Country of the user

Luckily, we don't have any missing values. So let's start feature selection.

EDA

Here, let's start with checking the imbalance of the labels,

Here, the 0 represents that the transaction wasn't fraudulent and 1 represents that the transaction was fraudulent. We have 108735 records of no fraud detected vs 11265 records of fraud detected. This shows that the dataset is imbalanced, but not that imbalanced as we see in some credit card fraud detection datasets.

Let's visualize the impact of different variables

Now, let's analyze the relation of age of the user and frauds commited

Here, we can see that most number of frauds are commited by someone aged between 30 and 45. Also, the reason behind that is that the number of users in that age bracket is maximum. Also, the distribution remains constant.

Above, the distribution is almost similar but Users who directly come to the website tend to commit more frauds than compared to total users who visit the site directly. But it is not significant.

We can depict from above that chrome is most used browser by users and hence, most used browser when committing fraud.

Here as well, the distribution is quite similar. More purchases are cheaper and not a lot of money is spent when fraud takes place.

Now, to analyze the date or time, we need to convert the purchase_date and signup_date into something more meaningful.

Above, we can see that most fradulent purchases are made at around 9 am and at 5 pm.

Here, we can see that fraudulent signups are made at 9 am and 5 pm. Similar trend is seen in purchase time.

Surprisingly, we have extremey low number of fraudulent purchases in the end of the month.

Similar trend as purchase date can be seen in signup as well.

Here, it is obvious that most fraudulent transactions occur within 30 days of signing up. Thus, this can be an important feature for us.

Now, this is enough for EDA. Let's now move on to feature selection where we would use the knowledge gained during this process.

Feature Selection

Here, we first drop the user_id column as we don't really get any insight from it. Moreover, as the dataset has the first transaction of the user so we are only going to have one record per user. Also, we can see below that one user_id is used only once.

Now, let's try to convert categorical variable into something that we can use for our classification model.

First, let's try to anaylze the country:

Also, let's analyze the country-wise frauds

Here, we have a choice to make:
1) We can either drop the country columns as ratio of (Number of Frauds/Number of Users) is almost similar in most cases, and this way we won't be losing much of the information
2) We use one-hot encoding/hashing and convert this categorical variable which we can use in our model training

Though, before dropping the column, we need to note that some countries like Saudi Arabia, Sri Lanka and other middle eastern countries have higher fraud ratio compared to other countries.

Let's apply one-hot encoding to it and try to analyze it's importance.

Here, we have Unknown country as the second highest cases of fraud cases reported. We have another decision to make here.
1) Do we drop the rows with Unknown country?
2) Do we use in hashing?
Considering that we will have limited data left if we drop the rows with Unknown country name, we can't choose that option. In many cases, since we are getting the Country name from the IP address, the chances are that IP would be difficult to track. So let's just continue with considering Unknown as just another value.

Above, we can analyze that countries with higher ration of frauds have higher feature importance. Countries like Ireland, Ecuador, Luxembourg, etc have higher ratio compared to other countries so they are more relevent.

So In my opinion, we should keep the Country column since we are making a generalized model which would have users from all countries.

We have to keep in mind that columns with high cardinality can cause tree-based algorithms some trouble. So let's try to come up with a way we can keep the Country column and deal with the cardinality issue faced in RandomForest.

Since we are using RandomForest, we can use Boruta as well for feature importance, but since we are not going to use any other library other than Scikit-learn, we are going to rely on results we got above.

Let's try Hashing now.

Hashing

Here, we can see that hashed variables have more feature importance than some other variables. We can use this transformed categorical transformation.

Let's drop the country column and replace it with hashed variables, and replace sex,source and browser with one-hot encoded features.

Now that we know hashing can be useful, we can use it along with other features to compare.

Let's split the data in to train and test and start preparing our features.

Now, let's analyze the device_id and ip_address next.

Here, we can see that there is unknown country as well. We can see if we can do something to solve that.

Table above shows that the possibility of fraud is maximum when ip_address is used before. We need to convert it so that we can effectively use it to train our model.

First, let's create a dicionary to check the number of times the particular IP Address is used. We would need this dictionary to track the amount of time in the test data as well.

Now, let's analyze the device_id.

Similar is same for device_id as devices used multiple times have higher possibility of being fraudulent

Hashing on training dataset

Now, this is the dataframe that we can use to compute feature importance

As we can see, the added variables such as Number_of_times_IP_used_before, Number_of_times_device_ID_used_before, purchase_month,elapsed_time_weeks, etc have significantly higher feature importance.

Training the Model

Let's prepare the test dataset

Since our data is unbalanced, we can apply various over-sampling, undersampling methods to tackle this issue. But, since, we only have access to Sklearn, we would just stick with what we can achieve using sklearn.

Let's try different base models first

Random Forest

Here, the results are quite decent but the recall is too low for our usecase as it costs us $8 for false positive but false negative can cost us much more than that so we need higher recall.

Gradient Boosting

Same thing can be said about above result. Recall we find is not good enough to use.

Random Forest

Now, let's try to balance the dataset using Upsampling and analyze the results

Since, we are working with imbalanced dataset, let's try anomaly detection.

Here, we can analyze that there was no significant change in result, so now let's move on to Anomaly detection.

Here, the results are not impressive as compared to Random Forest

Now, let's try to implement AutoEncoders

Here, again this result we get is also not that great. So, we can conclude that Random Forest is our best option for this dataset.

Now, let's try to remove some features that are not important and try RandomForest Again

Here, after dropping some features, we get slight drop in recall

In terms of precision-recall trade-off, there is not many things that we can do to improve the precision without it affecting the Recall

Now, let's try to apply StandardScalar() on numerical variables and check the results again.

Well, applying StandardScaler doesn't improve the result. Now, Let's try to optimize the hyperparameters using Optuna.

Well, We tried to optimize the results using Accuracy but now let's try to optimize the model using PRAUC

Here, the Recall is same as we got from Stratified k-fold, which suggests that our model is not overfitting

Here, though our recall is quite low, we can change our threshold to increase the recall which would decrease precision but that is a trade-off that we have to consider.

Here, we get the PR AUC score as 0.51 means out of all fradulent cases, we can correctly classify around 50% of them. Which is somewhat decent model.

Referred Links:
https://machinelearningmastery.com/feature-selection-with-categorical-data/
https://medium.com/adj2141/credit-card-fraud-detection-using-machine-learning-899af62df3ab
https://mlopshowto.com/detecting-financial-fraud-using-machine-learning-three-ways-of-winning-the-war-against-imbalanced-a03f8815cce9
https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
https://stackoverflow.com/questions/40739152/how-to-use-sklearn-featurehasher
https://analyticsindiamag.com/python-guide-to-precision-recall-tradeoff/
https://glassboxmedicine.com/2019/03/02/measuring-performance-auprc/
https://stats.stackexchange.com/questions/113326/what-is-a-good-auc-for-a-precision-recall-curve
http://qingkaikong.blogspot.com/2016/04/plot-histogram-on-clock.html